Deepfake Detection Challenge: A bronze solution (yet)
A bronze medal solution for Kaggle Deepfake Detection Challenge
Just about 4 months ago Kaggle started hosting a very interesting competition with a prize money of $1,000,000: Deepfake Detection Challenge. Although it is very tempting to try to get this kind of prize, for me it's always about learning when it comes to Kaggle competitions. Unfortunately, I joined the competition pretty late, about when only a month left, but still I tried to give 100% to see how much I can achieve and learn. The competition now ended but final results on the Private Leaderboard will be revealed once the participant models are evaluted on a hold-out set by Facebook.
AWS, Facebook, Microsoft, the Partnership on AI’s Media Integrity Steering Committee and academics came together to build this challenge by providing a dataset of ~100K (~500 GB) real and fake videos. First, I would like to thank all these organizations and individuals for creating this challenge and Kaggle for hosting it to let talented people in the field to work on such an important problem for our society.
Without a doubt, Deepfakes and similar content generation and manipulation adversery methods are great threats to everyone. It can have significant implications in terms of quality of public discourse and the safeguarding of human rights. Misinformation can lead to dangerous and even fatal outcomes. These kind of threats not only appear in computer vision but also in NLP. For example, Open AI's gigantic GPT-2 model had similar controversies about advarsarial risks and the actual model trained by the team was not initially released for this same reason until some time.
"These samples have substantial policy implications:large language models are becoming increasingly easy to steer towards scalable, customized, coherent text generation, which in turn could be used in a number of beneficial as well as malicious ways".
We're releasing the 1.5billion parameter GPT-2 model as part of our staged release publication strategy.
— OpenAI (@OpenAI) November 5, 2019
- GPT-2 output detection model: https://t.co/PX3tbOOOTy
- Research from partners on potential malicious uses: https://t.co/om28yMULL5
- More details: https://t.co/d2JzaENiks pic.twitter.com/O3k28rrE5l
As AI techniques evolve, people will not stop using them for harmful purposes but this shouldn't enforce any barriers for technological development but rather allow everyone in the community to contribute to the fight against them through transparency.
Those who are further interested and concerned about implications of AI in ethics and society can check out Fast.ai's blog series in ai-in-society.
The goal of the competition was to be able to detect real and fake videos. It can be simply framed as a video classification task. The provided training dataset had only the binary indicator of fakeness for each video and no other information beyond it. It was also stated that the fake videos can have visual, audio or both kinds of manipulations. The technical details of the video manipulation methods were not publicly mentioned. The reason was not to defeat the purpose of building a robust and general deepfake detection model.
Logloss was selected as the evaluation metric. It can be considered as a better choice compared to accuracy as it also captures the confidence of the predictions made.
$$\textrm{LogLoss} = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right]$$
The dataset was created with the help of volunteer actors and their self recorded videos. Then each of these videos were manipulated with different deepfakes methods. Each original video had multiple different fakes which corresponds to a particular method.
Here, let's display an original video and it's fake. In some cases, the differences may be very subtle to human eye.
Tip: look closer to mouth and eyes in both videos. You may also notice minor generation artifacts in the video on the right (or bottom) - you will see that the face tilts as frames change.
After exploring the dataset and reading discussions in Kaggle forum, it became more clear that various facial manipulation techniques were employed in terms of quality and localization. It should also be noted these deepfakes methods are far from perfect, so it introduces an additional noise to the training set. More detailed information on a preview version of the dataset is available here in this paper.
You may see an example batch of face manipulations. In these pairs of face crops; left corresponds to the original and right correponds to the fake video frame.

When I joined the competition it was relatively late but thanks to everyone in Kaggle contributing with kernels and discussions helped me to get up to speed pretty easy. Given the fact that it was my first time working on a video dataset, I had to grasp the important bits of working on such a problem and determine what to focus on fast.
In a competition setting it's really helpful to read what others already find out and try to build improvements on top of these publicly available ideas. You can also notice that people in the top ranks are usually the once who are silent!
Before doing any coding the initial step I took was to create a scientific journal and to write down the strategies that I was going to try. I also prioritized them by (potential utility) / (cost of development) ratio. Of course, some edits were made to this journal as I was working on the competition but I can say that 80% of it remained unchanged. For such journal I simply used OneNote and nothing fancy.

I nearly spent ~3 weeks on data and ~1 week on modeling. All my work can be obtained from github. I used nbdev to speed up my code base development process using notebooks. I highly recommend for other competitions if you are more productive with jupyter notebooks but still need the power of modularizing your code. It can give upto x2 - x3 of productivity boost.
The final solution I came up with can be broken into 2 parts: training pipeline and inference pipeline. Training pipeline consisted of extracting faces with a detection model, creating a validation set, a custom batch sampling startegy, regularization and model training. Since this was a Code Compeition had the following environment constraints for completing inference of ~4000 videos: 9 hours of runtime, no internet access and no more than 1 GB of external dataset (including pretrained model weights). So, for inference optimize the code is really important and it needs to be thought before anything because your final solution may end up being too complex to fit into such a setting. That's why video loading and batching performance was very critical for making successful submission.
As I mentioned earlier it was my first experience with video datasets. I read quite a bit about which library to use for my fast video loading and batching. It turned out there are couple of options like OpenCV, Decord and Nvidia DALI.
Opencv can be considered as a baseline as it's more of a vanilla approach which is commonly used.
# collapse-show
import cv2
def open_cv_video_reader(path, freq=None):
"optionally sample by freq and yield RGB image frames from a video"
vidcap = cv2.VideoCapture(str(path))
video_len = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
for i in range(video_len):
vidcap.grab()
if (freq is None) or (i % freq == 0):
_, image = vidcap.retrieve()
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
yield image
%%timeit
frames = list(open_cv_video_reader('deepfakes/btohlidmru.mp4', 10))
Decord is another library I came across in one of the kernels that been shared. You will need to build it from the source to get the latest improvements, such as PyTorch bridge. Decord also have a ctx=gpu() option which is very fast if you have a good GPU. GPU version had memory leakage problem and it couldn't be solved during the time. Still, I believe it's a very promising library and I recommend anyone interested in video deep learning applications to keep an eye on it. Eventhough decord looks slower here, with it's pytorch bridge it was a faster option for creating batches on a AWS p3 instance.
#collapse-show
from decord import VideoReader
from decord import cpu
# from decord.bridge import set_bridge
# set_bridge('torch')
def decord_cpu_video_reader(path, freq=None):
video = VideoReader(str(path), ctx=cpu())
len_video = len(video)
if freq: t = video.get_batch(range(0, len(video), freq))
else: t = video.get_batch(range(len_video))
return t
%%timeit
frames = decord_cpu_video_reader('deepfakes/btohlidmru.mp4', 10)
Last option I investigated was DALI but I couldn't find an easy way to use it the way I would like to for my pipeline, e.g. reading every 10th frame. So, I didn't spent too much time on it.
It was almost certain that all modifications on videos were made on an actor's face and faces were just a small proportion of the video frames. So, it felt like a good idea to extract faces from videos as a first step of data prepration. This, in theory would have allowed to improve the signal/noise ratio and let CNN based models to focus to the right places.
For detection I used mobilenet detector in here. It's a very lightweight model in terms of memory, space and also pretty fast during inference. I didn't have time to prepare a better dataset by dealing with potential false negatives/positives that occur from detection. That may have allowed a gain in overall model performance. I extracted ~30 frames (1/10 of each frame with equal intervals) for each video and stored only faces. Each face crop was enlarged by a factor of x1.3 after detection.
As I mentioned earlier, the total size of full video dataset was ~500 GB in zip format but Kaggle also provided a chunked version of it which had 50 parts. Instead of getting an expensive AWS machine with 1TB of EBS volume, I wrote a script which sequentially downloads a chunk (1 of 50), detects faces using RetinaFace, saves cropped faces and finally deletes the processed chunk of videos. This is repeated sequentially for all 50 chunks. This allowed me to save a lot of disk space. Eventhough AWS gave a credit of $650 to participants I optimized the code as I was paying from my own pocket. It's always good to learn and practice these kind of things when dollars involved :)
Discussions were really helpful in terms of deciding a proper validation set. My thanks to everyone who shared what worked and what didn't. This allowed me to save a lot of time and not to spend too much time on methods like grouping based on face embeddings. I finally used chunks 1-40 as training, 41-45 as validation and 46-50 as test. Once I decided on modeling I merged training and test for final training.
Once I started training a baseline model I quickly noticed that a vanilla random batch sampling won't do good. So I decided on a custom batch sampler. In a batch, this sampler basically randomly picks n number of original videos and a random fake video for each of these n original videos. Then for each real-fake pair video same frame crop is used. This strategy helped to get a decent public LB score but still, the model was overfitting pretty bad and fast.
What didn't work: I also had another sampler that utilized real video and it's all corresponding fakes again using the same frame crop for each. This was significantly worse compared to balanced sampling (1 real - 1 random fake).
Custom Sampler
#collapse
from torch.utils.data import Sampler
class SingleFrameRealFakeSampler(Sampler):
"Sample single random fake for each source with same single crop frame"
def __init__(self, train_df):
self.train_df = train_df
self.unique_originals = np.unique((self.train_df.original).dropna())
# get fname indexes for sampling
self.fname2idx = {k:v for v,k in enumerate(self.train_df['fname'])}
# get source:fakes mapping
self.source2fakes = create_source2fakes(train_df)
for k,v in self.source2fakes.items(): assert len(v) > 0
# convert face crop fnames to array
self.face_crop_fnames = self.train_df['face_crop_fnames'].values
def __iter__(self):
# shuffle original videos
unique_originals = np.random.permutation(self.unique_originals)
# collect indexes for source and fake
all_idxs = []
for source in unique_originals:
fake = np.random.choice(self.source2fakes[source])
source_fname_idx = self.fname2idx[source]
fake_fname_idx = self.fname2idx[fake]
# pick a random frame for real and fake
num_frames = len(self.face_crop_fnames[source_fname_idx])
rand_crop_idx = np.random.choice(range(num_frames))
# real video, fake video, random frame
all_idxs.append((source_fname_idx, fake_fname_idx, rand_crop_idx))
return iter(all_idxs)
def __len__(self):
return len(self.unique_originals)*2
Custom Batch Sampler
#collapse
class SingleFrameRealFakeBatchSampler(Sampler):
"Batch real and fake pairs from sampler with same tfms"
def __init__(self, sampler, batch_size, drop_last, tfms=True):
self.sampler = sampler
self.batch_size = batch_size
self.drop_last = drop_last
self.tfms = tfms
def __iter__(self):
batch = []
for real_idx, fake_idx, crop_idx in self.sampler:
# get tfms dict for augmentation
if self.tfms: tfms_dict = get_tfms_dict(0.2, 0.2, 0.2, 0.2, 0.4)
else: tfms_dict = None
batch.append((real_idx, crop_idx, tfms_dict))
batch.append((fake_idx, crop_idx, tfms_dict))
if len(batch) == self.batch_size:
yield batch
batch = []
if len(batch) > 0 and not self.drop_last:
yield batch
def __len__(self):
if self.drop_last:
return len(self.sampler) // self.batch_size
else:
return (len(self.sampler) + self.batch_size - 1) // self.batch_size
I think the biggest challenge was avoiding overfitting and the potential risk of memorizing individual faces. At first I tried trivial data augmentation strategies.
Crappification:downsampling image and upsample again to mimic low resolution/compression scenarios.
#collapse-show
def crappify(pilimg):
pilimg = pilimg.resize((64,64))
pilimg = pilimg.resize((224,224))
return pilimg
Left right flipping
Brightness change - darker or lighter
Random zoom into face crops (x1.15 - x1.35)
Mixup
Randmerge
Just in the last 3 days of the competition I added a custom augmentation strategy similar to cutmix which is inspired by another compeition from @godaibo from his TWIML podcast interview. For a given face crop, I merge it vertically (50% - 50%) with another face crop from the same class. This gave a boost of 0.02 in pubic LB. The idea was to not memorize individual faces. The reason for 50% and vertical merge was assumptions based on the symmetry of fake alterations, e.g. change of eyes, mouth, nose …etc.
#collapse
def rand_pair_merge(crop_fname, label, face_crop_fnames, labels):
"crop_fname to merge with a random face with same label"
# pick random crop with same label
rand_pair_idx = np.random.choice(np.where(labels == label)[0])
rand_pair_crop_fname = np.random.choice(face_crop_fnames[rand_pair_idx])
# read both crops
pilimg1 = read_pilimg(cropped_path/crop_fname)
pilimg2 = read_pilimg(cropped_path/rand_pair_crop_fname)
# merge both with 50:50
if np.random.uniform() < 0.5:
merged_pilimg = PIL.Image.fromarray(np.hstack([np.asarray(pilimg1)[:,:112,:],
np.asarray(pilimg2)[:,112:,:]]))
else:
merged_pilimg = PIL.Image.fromarray(np.hstack([np.asarray(pilimg2)[:,:112,:],
np.asarray(pilimg1)[:,112:,:]]))
return merged_pilimg

Resnet34, EfficientNet b5, Efficient b7 - Efficient alone was enough to get my best score. I did simple averaging. I tried both with and without TTA but didn't get much of a boost. All models are finetuned using gradual unfreezing and 1-cycle policy. FocalLoss was used at a later stage of training to update model with harder samples. I also used simple clipping (0.01-0.99) on final probability predictions just to be safe about very high false positives and very low false negatives.
As of writing this I am at 199th place out of 2281 participants in Public Leaderboard. Let's see how things will turn out in the Private Leaderboard once Gods of Shake Up speak.
- LRCN based modeling didn't work and suffered from severe overfitting.
- Sampling 1 real video and all it's fakes also didn't give consistent results compared to balanced sampling.
I will be updating this blog post as top solutions share their approaches and once the Private Leaderboard is revealed.